October 30, 2018

Geospatial analyses and epidemiology

John Snow, Map of Cholera deaths in relation to water Pumps, London, 1854

Basics of Mapping GIS data in R

R can be a powerful and accesible way to do geospatial analyses. In general, there are two type of spatial data.

Vector data

Vector data can be points (e.g. household locations in a survey), lines (e.g. roads or rivers), or polygons (e.g. country level administrative boundaries). These primarily come in the form of ‘shapefiles’.

Raster data

An area gridded into cells at some spatial resolution each contatining a value (can be categorical or continuous):

Today

For today, we will use administrative shapefiles (polygons) at the district level for the two Indian States for which we have data for and raster data from World Pop.

We will learn how to: 1. Read in and write out vector + raster data 2. Extract raster data to polygons 3. Aggregate data spatially and map

Set-up

These are the libraries we’ll need to install for the tutorial. There are many different packages to work with geospatial data in R, but the ones we’ll use today are rgdal, sf, and raster.

# install these if not already done
# install.packages(c("rgdal", "sf", "raster", "tidyverse", "magrittr"))
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-6, (SVN revision 841)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28
##  Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/gdal
##  GDAL binary built with GEOS: FALSE 
##  Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520]
##  Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/proj
##  Linking to sp version: 1.3-1
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(raster)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks raster::extract()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::select()  masks raster::select()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:raster':
## 
##     extract
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(patchwork)
## 
## Attaching package: 'patchwork'
## The following object is masked from 'package:raster':
## 
##     area

Reading in and writing out shapefiles

To read in shapefiles, we will use the readOGR function in the rgdal package. We will read in the administrative boundary 2 (district level) shapefile for the country of India (we’ll call it ‘india_shp’ for now).

One thing you might notice is that there are many different files with different extensions in the folder data/India where the national shapefile is stored.

Reading in and writing out shapefiles

These files must be stored together and are necessary to read in the shapefile, but the one you select is the file with the extension ‘.shp’.

india_shp <- readOGR("data/India/India.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/mrajeev/Documents/Projects/mapping-chennai/data/India/India.shp", layer: "India"
## with 666 features
## It has 16 fields

Subetting to our states of interest

This is a really big file and as we are only working with data from two states, Assam and Punjab, we can subset these spatial data, just as you would a regular data frame. If you look at the data frame (try head and names), then you will see that the field for the state level is name_1. We want to subset this to Punjab and Assam.

punjab <- subset(india_shp, name_1 == "Punjab")
plot(punjab, main = "Punjab State")

Subetting to our states of interest

Now try the same for Assam:

Subetting to our states of interest

Now try the same for Assam:

assam <- subset(india_shp, name_1 == "Assam")
plot(assam, main = "Assam State")

Writing shapefiles

If you want to save shapefiles you’ve made, you can write out the shapefile using the writeOGR function. You need to specify the object you want to write, the name of the folder where the files will go, the name of the file, and the driver. When writing shapefiles, in general you can use ‘ESRI Shapefile’ as the driver.

writeOGR(punjab, "output/Punjab", "Punjab", driver = "ESRI Shapefile", overwrite = TRUE)

Reading in and writing out raster data

The other spatial data we will work with today is high resolution estimates of human populations from the World Pop Project (you can read more about how these data are created and what else is available at http://www.worldpop.org.uk). This data gives you the number of people per 100 x 100m grid cell, but I’ve aggregated it up to 10 x 10 km km for today so that it’s easier to work with (checkout the raster::aggregate function if you want to try it yourself).

We will use the raster function to read in the raster file:

pop <- raster("data/india_pop.tif")
names(pop) <- "pop"
plot(pop)

Reading in and writing out raster data

If you want to write out the raster data:

writeRaster(pop, "output/pop.tif")

Extracting raster data to polygons

We can extract this data to our admin shapefiles for Assam and Punjab to get the estimated human population per district for each state. We do that using the crop, mask, and extract functions from the raster package. Note of caution: this can be a very slow process with large shapefiles or high resolution rasters!

assam_pop <- crop(pop, assam) # First we crop our pop raster to Assam
assam_pop <- mask(assam_pop, assam) # Then we have to mask this so that all values outside of Assam are set to NA

Extracting raster data to polygons

Try it for Punjab now:

Extracting raster data to polygons

Try it for Punjab now:

# Try it for Punjab now:
punjab_pop <- crop(pop, punjab) 
punjab_pop <- mask(punjab_pop, punjab) 

Visual check

We can plot to make sure it worked:

par(mfrow = c(1, 2), mar = c(2, 2, 2, 2))
plot(punjab_pop, main = "Human pop Punjab")
plot(punjab, add = TRUE)
plot(assam_pop,  main = "Human pop Assam")
plot(assam, add = TRUE)

Extracting data

Now we will extract these data, which basically means we will take all the grid cells that fall within a polygon and calculate a summary statistic. In our case we want to get the sum of values (i.e. the number of people living in each district).

assam <- extract(assam_pop, assam, fun = sum, # we take the sum of the grid cells in each district
                 na.rm = TRUE, # grid cells with NA pops are ignored
                 sp = TRUE) # gives us back a spatial polygons data frame

Extracting data

Try it for Punjab now:

Extracting data

Try it for Punjab now:

punjab <- extract(punjab_pop, punjab, fun = sum, na.rm = TRUE, sp = TRUE)

Aggregating serology data

We don’t actually have spatial data available, so I’ve assigned a district randomly to each individual in the survey. We can now aggregate this data to the district level. A new package called sf facilitates using the tidyverse (i.e. dplyr + ggplot2) for spatial data. For the rest of the tutorial we’ll use this package to make data manipulation and plotting easier.

Aggregating serology data

First we need to convert our spatial data frames to ‘sf’ objects:

# convert using st_as sf function
punjab_sf <- st_as_sf(punjab)
## Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.8.0-CAPI-1.13.1
## and GEOS at installation 3.7.2-CAPI-1.11.2differ
assam_sf <- st_as_sf(assam)

sf (simple features) objects

Check out the sf objects:

class(punjab_sf)
## [1] "sf"         "data.frame"

sf (simple features) objects

Check out the sf objects:

head(punjab_sf)
## Simple feature collection with 6 features and 17 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 73.87089 ymin: 29.77701 xmax: 76.58119 ymax: 32.04543
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##     iso admn_lv name_0     id_0  type_0 name_1     id_1 type_1          name_2
## 465 IND       2  India 10000881 Country Punjab 10315044  State        Amritsar
## 466 IND       2  India 10000881 Country Punjab 10315044  State        Faridkot
## 467 IND       2  India 10000881 Country Punjab 10315044  State         Barnala
## 471 IND       2  India 10000881 Country Punjab 10315044  State        Bathinda
## 474 IND       2  India 10000881 Country Punjab 10315044  State Fatehgarh Sahib
## 476 IND       2  India 10000881 Country Punjab 10315044  State         Fazilka
##         id_2   type_2 name_3 id_3 type_3    source cntry_l       pop
## 465 10019070 District     NA   NA     NA GADM v2.8   IND_2 2459497.1
## 466 10019073 District     NA   NA     NA GADM v2.8   IND_2  669828.9
## 467 10019071 District     NA   NA     NA GADM v2.8   IND_2  699785.0
## 471 10019072 District     NA   NA     NA GADM v2.8   IND_2 1500438.6
## 474 10019074 District     NA   NA     NA GADM v2.8   IND_2  652263.9
## 476 10019075 District     NA   NA     NA GADM v2.8   IND_2 1092486.0
##                           geometry
## 465 MULTIPOLYGON (((74.92346 32...
## 466 MULTIPOLYGON (((74.89951 30...
## 467 MULTIPOLYGON (((75.71488 30...
## 471 MULTIPOLYGON (((75.17096 30...
## 474 MULTIPOLYGON (((76.52028 30...
## 476 MULTIPOLYGON (((74.2091 30....

They are data frames with geometries stored as a single column in a list.

Plotting sf objects

If you use the plot function on an sf object, you need to specify which variable to plot by:

# plot using sf notation--lets just do the pop
plot(punjab_sf["pop"]) # see the help on plot_sf for more options on customization

If you don’t specify, you end up trying to plot all the attributes, which can be slow and messy!

Plotting sf objects

You can also use ggplot (I prefer this):

# you can also use ggplot
(ggplot(assam_sf) + geom_sf(aes(fill = pop))) + (ggplot(punjab_sf) + geom_sf(aes(fill = pop))) 

Aggregating data to district level

We can read in our serodata from earlier and we will change the qualitative results to 0 for positive and 1 for negative to make summarizing easier:

df.eia <- read_csv("data/serodata_spatial.csv")
## Parsed with column specification:
## cols(
##   SITEID = col_character(),
##   AGE = col_double(),
##   REVQUAL_M = col_character(),
##   REVQUAL_R = col_character(),
##   district = col_character()
## )
df.eia %>% # change so that results are 0/1 to more easily aggregate
  mutate(REVQUAL_M = ifelse(REVQUAL_M =="Positive", 1, 0),
         REVQUAL_R = ifelse(REVQUAL_R =="Positive", 1, 0)) -> df.eia 

Aggregating data to district level

We will now calculate summary statistics for each variable

df.eia %>%
  filter(SITEID == "Assam") %>% # filter for Assam
  group_by(district) %>% # group by district
  summarize(sample_size= n(), pos_measles = sum(REVQUAL_M), 
            pos_rub = sum(REVQUAL_R),
            avg_age_sampled = mean(AGE), 
            avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
            avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
  right_join(assam_sf, by = c("district" = "name_2")) %>%
  st_as_sf() -> assam_dist
## Warning: Column `district`/`name_2` joining character vector and factor,
## coercing into character vector

Aggregating data to district level

Now try for Punjab:

Aggregating data to district level

Now try for Punjab:

# Do the same for punjab to get the same summarized vars for each district
df.eia %>%
  filter(SITEID == "Punjab") %>% # filter for Assam
  group_by(district) %>% # group by district
  summarize(sample_size= n(), pos_measles = sum(REVQUAL_M), 
            pos_rub = sum(REVQUAL_R),
            avg_age_sampled = mean(AGE), 
            avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
            avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
  right_join(punjab_sf, by = c("district" = "name_2")) %>%
  st_as_sf() -> punjab_dist
## Warning: Column `district`/`name_2` joining character vector and factor,
## coercing into character vector

Mapping data

Now we can visualize our serodata spatially. Lets map the proportion seropositive for measles by district for each state:

# Map the average age of sampled + infection 
assam_dist %>%
  ggplot() + geom_sf(aes(fill = pos_measles/sample_size)) + 
  scale_fill_continuous(name = "Proportion \n seropositive: \n measles", 
                        type = "viridis")

punjab_dist %>%
  ggplot() + geom_sf(aes(fill = pos_measles/sample_size)) + 
  scale_fill_continuous(name = "Proportion \n seropositive: \n measles", 
                        type = "viridis")

Play around plotting other variables

# Map the average age of sampled + infection for rubella for the state of assam
assam_dist %>%
  ggplot() + geom_sf(aes(fill = avg_age_sampled)) + 
  scale_fill_continuous(name = "Average \n age sampled", type = "viridis")

Excercises

While we don’t have true spatially resolved data at the district level, we will use the data that we generated to practice what we learned earlier today. Note: this is just to learn how to work with data rather than a question driven modeling approach (which is always preferrable;)

Excercise 1: extract births by district

In the data folder is a raster dataset of births at a 10 x 10 km km grid for India taken from World Pop. Extract this data to the district level for Assam and Punjab (you will need to use the original spatial polygon data frames, i.e. punjab not punjab_sf)

Excercise 1: extract births by district

births <- raster("data/india_births.tif")
names(births) <- "births"
punjab_births <- crop(births, punjab) 
punjab_births <- mask(punjab_births, punjab) 
punjab <- extract(punjab_births, punjab, fun = sum, na.rm = TRUE, sp = TRUE)

assam_births <- crop(births, assam) 
assam_births <- mask(assam_births, assam) 
assam <- extract(assam_births, assam, fun = sum, na.rm = TRUE, sp = TRUE)

Excercise 2: glm on spatial data (seropositivity ~ birth rate)

Explore the relationship between seropositivity (the response variable) at the district level and the per capita birthrate (the predictor variable, calculate as the number of births/population) for rubella for Punjab.

Hint: you will want to use family = binomial, but this time your data will follow the following formula: glm(cbind(successes, failures) ~ predictor, family = binomial) where the successes = # of seropositives and failures = # of seronegatives.

Excercise 2: glm on spatial data (seropositivity ~ birth rate)

punjab_sf <- st_as_sf(punjab)

df.eia %>%
  filter(SITEID == "Punjab") %>% # filter for Punjab
  group_by(district) %>% # group by district
  summarize(sample_size= n(), pos_measles = sum(REVQUAL_M), 
            pos_rub = sum(REVQUAL_R),
            avg_age_sampled = mean(AGE), 
            avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
            avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
  right_join(punjab_sf, by = c("district" = "name_2")) -> punjab_dist
## Warning: Column `district`/`name_2` joining character vector and factor,
## coercing into character vector

Excercise 2: glm on spatial data (seropositivity ~ birth rate)

punjab_dist %$%
  glm(cbind(pos_rub, sample_size - pos_rub) ~ births/pop, family = binomial)
## 
## Call:  glm(formula = cbind(pos_rub, sample_size - pos_rub) ~ births/pop, 
##     family = binomial)
## 
## Coefficients:
## (Intercept)       births   births:pop  
##   2.615e-01   -1.932e-05    2.896e-12  
## 
## Degrees of Freedom: 21 Total (i.e. Null);  19 Residual
## Null Deviance:       33.93 
## Residual Deviance: 32.15     AIC: 119.1

Bonus 1

Combine both Punjab and Assam datasets and add in state as a random effect!**

Hint: you will need to use the glmer function from the lme4 package.

Bonus 1

library(lme4)
assam_sf <- st_as_sf(assam)

df.eia %>%
  filter(SITEID == "Assam") %>% # filter for assam
  group_by(district) %>% # group by district
  summarize(sample_size= n(), pos_measles = sum(REVQUAL_M), 
            pos_rub = sum(REVQUAL_R),
            avg_age_sampled = mean(AGE), 
            avg_age_posM = sum(AGE*REVQUAL_M)/pos_measles,
            avg_age_posR = sum(AGE*REVQUAL_R)/pos_rub) %>%
  right_join(assam_sf, by = c("district" = "name_2")) -> assam_dist
## Warning: Column `district`/`name_2` joining character vector and factor,
## coercing into character vector
df_combined <- bind_rows(assam_dist, punjab_dist)
## Warning in bind_rows_(x, .id): Vectorizing 'sfc_MULTIPOLYGON' elements may not
## preserve their attributes
## Warning in bind_rows_(x, .id): Vectorizing 'sfc_MULTIPOLYGON' elements may not
## preserve their attributes
df_combined %>% mutate(birth_rate = births/pop) -> df_combined

glmer(cbind(pos_rub, sample_size - pos_rub) ~ birth_rate + (1 | name_1), family = binomial, data = df_combined)
## boundary (singular) fit: see ?isSingular
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(pos_rub, sample_size - pos_rub) ~ birth_rate + (1 | name_1)
##    Data: df_combined
##       AIC       BIC    logLik  deviance  df.resid 
##  244.4491  250.1246 -119.2246  238.4491        46 
## Random effects:
##  Groups Name        Std.Dev.
##  name_1 (Intercept) 0       
## Number of obs: 49, groups:  name_1, 2
## Fixed Effects:
## (Intercept)   birth_rate  
##     -0.7297      45.3317  
## convergence code 0; 1 optimizer warnings; 0 lme4 warnings

Bonus 2

I didn’t actually assign individuals randomly to districts! Can you figure out which variable I used to allocate sampled individuals (hint: all the data you have is the data I have:)?

Bonus 2

df_combined %$% plot(pop, sample_size, bty = "l", xlab = "District pop", ylab = "Number sampled")

Bonus 2

punjab$samples <- as.vector(rmultinom(n = 1, size = nrow(df.eia %>% filter(SITEID == "Punjab")), 
                                      prob = punjab$pop))
assam$samples <- as.vector(rmultinom(n = 1, size = nrow(df.eia %>% filter(SITEID == "Assam")), 
                                     prob = assam$pop))
alloc_punjab <- rep(punjab$name_2, punjab$samples)
alloc_assam <- rep(assam$name_2, assam$samples)

# The jumble function!
jumble <- function(x) {
  jumbled <- x[order(runif(length(x), min = 0, max = 1))]
  return(jumbled)
}

df.eia %>%
  arrange(SITEID) %>% # in ascending order with Assam 1st!
  mutate(district = c(as.character(jumble(alloc_assam)),
                      as.character(jumble(alloc_punjab)))) -> df.eia

Further resources for geospatial analyses in R

Packages used

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.0.0.9000 lme4_1.1-21          Matrix_1.2-17       
##  [4] magrittr_1.5         forcats_0.4.0        stringr_1.4.0       
##  [7] dplyr_0.8.3          purrr_0.3.2          readr_1.3.1         
## [10] tidyr_1.0.0          tibble_2.1.3         ggplot2_3.2.1       
## [13] tidyverse_1.2.1      raster_3.0-7         sf_0.8-0            
## [16] rgdal_1.4-6          sp_1.3-1             knitr_1.27          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3         lubridate_1.7.4    lattice_0.20-38    class_7.3-15      
##  [5] assertthat_0.2.1   zeallot_0.1.0      digest_0.6.23      R6_2.4.1          
##  [9] cellranger_1.1.0   backports_1.1.5    evaluate_0.14      e1071_1.7-2       
## [13] httr_1.4.1         pillar_1.4.2       rlang_0.4.4        lazyeval_0.2.2    
## [17] readxl_1.3.1       minqa_1.2.4        rstudioapi_0.10    nloptr_1.2.1      
## [21] rmarkdown_2.1      labeling_0.3       splines_3.6.1      munsell_0.5.0     
## [25] broom_0.5.2        compiler_3.6.1     modelr_0.1.5       xfun_0.12         
## [29] pkgconfig_2.0.3    rgeos_0.5-2        htmltools_0.4.0    tidyselect_0.2.5  
## [33] codetools_0.2-16   viridisLite_0.3.0  crayon_1.3.4       withr_2.1.2       
## [37] MASS_7.3-51.4      grid_3.6.1         nlme_3.1-141       jsonlite_1.6      
## [41] gtable_0.3.0       lifecycle_0.1.0    DBI_1.0.0          units_0.6-4       
## [45] scales_1.0.0       KernSmooth_2.23-15 cli_1.1.0          stringi_1.4.5     
## [49] xml2_1.2.2         vctrs_0.2.0        generics_0.0.2     boot_1.3-23       
## [53] tools_3.6.1        glue_1.3.1         hms_0.5.1          parallel_3.6.1    
## [57] yaml_2.2.0         colorspace_1.4-1   classInt_0.4-1     rvest_0.3.4       
## [61] haven_2.1.1